date symbol yhat_weekly yhat_daily
<Date> <char> <num> <num>
1: 2023-01-01 ATOM 0.46437995 0.43139842
2: 2023-01-01 LTC 0.20844327 0.21108179
3: 2023-01-01 MARSH 0.09498681 0.12664908
4: 2023-01-01 UNCX 0.51846966 0.51319261
5: 2023-01-01 UNISTAKE 0.30606860 0.32849604
6: 2023-01-01 TCT 0.01451187 0.01055409
Report: YIEDL Experiment 1 (Old vs. New Dataset)
Version: 0.1 (first draft)
1 Introduction
For the Yiedl competition we distribute a daily dataset, with week-to-week targets, which has fewer features than the new dataset given to Numerai for their crypto competition. We should compare the performance of the two dataset under a variety of models and check if it makes sense to push the adoption of the new dataset in our competitions.
This report summarises the findings from the first experiment. The ultimate goal is to identify the key performance differences in models trainned using either the old (weekly) or new (daily) YIEDL datasets. For this experiment, I decided to use a large grid search instead of a few fine-tuned models. Here are my reasons:
- Fine-tuned models (trained based on my previous experience with YIEDL and Numerai) may introduce survivorship bias.
- Training models from a large grid search will likely results in three groups of models: underfitted, about right, and overfitted.
- These three groups of models can kind of simulate a real-world situation where YIEDL would receive predictions from newbie, intermediate and expert users.
- If most models trained using daily data show improved out-of-bag predictive performance, this could strongly indicate that daily data is effective, thus addressing the primary research question.
2 Experiment Set-up
2.1 Datasets
The following two datasets from https://yiedl.ai/competition/datasets were used:
- YIEDL Weekly Data -
dataset_weekly_2025_15.zip - YIEDL Daily Data -
dataset_daily_2025_15.zip
2.2 Training vs. Test Periods
- Training: 2018-04-27 to 2022-10-31
- Embargo: 2022-11-01 to 2022-12-31 (a two-month gap period between training and test to avoid data leakage)
- Test: 2023-01-01 to 2025-04-06
2.3 Stats
TBA (no. of rows/columns, date range, …)
2.4 Features and Targets
TBA
2.5 Grid Search
The following parameters were used to build xgboost regression models:
# Define parameters for xgboost
params <- list(objective = "reg:squarederror", # fixed
eta = 0.01, # fixed
max_bin = 63, # fixed
tree_method = "gpu_hist", # fixed
gpu_id = 0, # fixed
booster = "gbtree", # fixed
max_depth = input_params$max_depth, # see below
max_leaves = 2**input_params$max_depth - 1, # see below
subsample = input_params$subsample, # see below
colsample_bytree = input_params$colsample_bytree) # see below
# Train xgboost model
model_weekly <- xgb.train(params = params, # as shown above
data = training_dataset, # training dataset for each target
nrounds = input_params$nrounds) # see belowThe dynamic variables for the grid search:
max_depth: 3, 4, 5, 6, 7, 8, 9max_leaves: calculated usingmax_depthsubsample: 0.5, 0.6, 0.7, 0.8, 0.9, 1.0colsample_bytree: 0.05, 0.1, 0.2, 0.3, 0.4, 0.5round: 500, 1000, 1500, 2000
2.6 Models
The models can be categorised into four groups:
- 1008 models trained with weekly data +
target_neutral—> predict on weekly data - 1008 mdels trained with daily data +
target_neutral—> predict on weekly data - 1008 models trained with weekly data +
target_updown—> predict on weekly data - 1008 models trained with daily data +
target_updown—> predict on weekly data
3 Predictions
Here is an example of predictions from models trained with the neutral targets:
Similarly, we can look at the predictions from models trained with the updown targets:
date symbol yhat_weekly yhat_daily
<Date> <char> <num> <num>
1: 2023-01-01 ATOM 0.01370640 0.01613749
2: 2023-01-01 LTC 0.00382110 0.00447802
3: 2023-01-01 MARSH 0.00873456 0.00900494
4: 2023-01-01 UNCX 0.00514489 0.01064970
5: 2023-01-01 UNISTAKE 0.00944165 0.01341015
6: 2023-01-01 TCT 0.00662424 -0.13239994
4 Evaluation Metrics
4.1 Primary Metrics
For target_neutral, I calculated the date-wise Spearman correlation by comparing the predictions from weekly/daily models with the targets. Here is an example:
date cor_weekly cor_daily
<Date> <num> <num>
1: 2023-01-01 0.09203025 0.06804244
2: 2023-01-08 -0.04149877 -0.06153740
3: 2023-01-15 0.07896631 0.12185307
4: 2023-01-22 0.09738196 0.11143340
5: 2023-01-29 0.05406949 0.05855512
6: 2023-02-05 0.03582397 0.03980841
Similarly, here is an example of the date-wise RMSE evaluation for target_updown:
date rmse_weekly rmse_daily
<Date> <num> <num>
1: 2023-01-01 0.1931192 0.1804634
2: 2023-01-08 0.3309541 0.2766155
3: 2023-01-15 0.2755370 0.1963707
4: 2023-01-22 0.2757088 0.2634905
5: 2023-01-29 0.4911367 0.4936971
6: 2023-02-05 0.5768729 0.5564189
4.2 Secondary Metrics
Based on the primary metrics, the following secondary evaluation metrics were calculated for further analysis:
- Mean of Spearman correlation / RMSE
- Trimmed Mean of Spearman correlation / RMSE (i.e. 10% of both ends were removed - this was needed to remove outliers in RMSE)
- Max Drawdown (for Spearman correlation - lower the better)
- Sharpe Ratio (for Spearman correlation - higher the better)
- Other metrics (more can be done for further analysis. This draft report only covers the metrics above for now)
5 Comparison (Neutral)
5.1 Mean Spearman Correlation
5.1.1 Hypothesis / Expectations
Models trained with daily data should have higher mean Spearman correlation when compared to those trained with weekly data.
5.1.2 Observations (Stats)
No. of daily models with higher mean correlation = 1008 out of 1008 (100%)
Range of weekly models’ mean correlation (cor_mean_wkly):
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.1283 0.1370 0.1387 0.1382 0.1399 0.1423
- Range of daily models’ mean correlation (cor_mean_daily):
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.1340 0.1433 0.1449 0.1443 0.1459 0.1479
- Range of raw performance differences (cor_mean_daily - cor_mean_wkly):
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.002133 0.005222 0.006134 0.006158 0.007129 0.012689
- Range of percentage differences (%) (diff / cor_mean_weekly):
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.591 3.731 4.400 4.464 5.180 9.893
5.1.3 Observations (Charts)
5.1.4 Result Table
Here is the full table of mean correlation comparison.
Notes:
rsamp=subsamplecsamp=colsample_bytreeround=roundcor_mean_wkly= mean correlation of weekly models’ predictionscor_mean_daily= mean correlation of daily models’ predictionsdiff=cor_mean_daily-cor_mean_wkly(i.e. positive differences mean the daily models are better)p_diff=diff/cor_mean_wkly * 100percentage difference (%)
6 Comparison (Updown)
6.1 Trimmed Mean RMSE
6.1.1 Hypothesis / Expectations
Models trained with daily data should have lower RMSE when compared to those trained with weekly data. Since a few outliers are expected in the mean values, trimmed mean (i.e. 10% data in both ends removed) is used for this analysis.
6.1.2 Observations (Stats)
No. of daily models with lower trimmed mean RMSE = 1008 out of 1008 (100%)
Range of weekly models’ trimmed mean RMSE (rmse_tm_wkly):
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.4583 0.5473 0.6009 0.5956 0.6421 0.7439
- Range of daily models’ trimmed mean RMSE (rmse_tm_daily):
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.4218 0.4712 0.5039 0.5050 0.5342 0.6428
- Range of raw performance differences (rmse_tm_daily - rmse_tm_wkly):
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.14027 -0.10743 -0.09196 -0.09059 -0.07554 -0.03166
- Range of percentage differences (%) (diff / rmse_tm_wkly):
Min. 1st Qu. Median Mean 3rd Qu. Max.
-20.836 -16.974 -15.217 -15.030 -13.471 -6.203
6.1.3 Observations (Charts)
6.1.4 Result Table
Here is the full table of mean correlation comparison.
Notes:
rsamp=subsamplecsamp=colsample_bytreeround=roundrmse_tm_wkly= trimmed mean RMSE of weekly models’ predictionsrmse_tm_daily= trimmed mean RMSE of daily models’ predictionsdiff=rmse_tm_daily-rmse_tm_wkly(i.e. negative differences mean the daily models are better)p_diff=diff/rmse_tm_wkly * 100percentage difference (%)
7 Conclusions
- A large grid search (1,008 combinations of different xgboost parameters) was used for this experiments.
- Pairs of weekly and daily models (trained using the same parameters) were used to produce out-of-bag predictions on the same weekly test data (i.e. from 2023-01-01).
- Early analysis of the performance comparison suggests that daily data improves out-of-bag predictive performance.